Contrastive learning has been successfully used for retrieval of semantically aligned sentences, but it often requires large batch sizes or careful engineering to work well. In this paper, we instead propose a generative model for learning multilingual text embeddings which can be used to retrieve or score sentence pairs. Our model operates on parallel data in $N$ languages and, through an approximation we introduce, efficiently encourages source separation in this multilingual setting, separating semantic information that is shared between translations from stylistic or language-specific variation. We show careful large-scale comparisons between contrastive and generation-based approaches for learning multilingual text embeddings, a comparison that has not been done to the best of our knowledge despite the popularity of these approaches. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval -- the last of which we introduce in this paper. Overall, our Variational Multilingual Source-Separation Transformer (VMSST) model outperforms both a strong contrastive and generative baseline on these tasks.
translated by 谷歌翻译
Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.
translated by 谷歌翻译
随着新趋势影响在线讨论,用户生成的社交媒体数据正在不断变化,从而导致社交媒体NLP应用程序的测试数据分布变化。此外,随着用户数据删除,培训数据通常可能会更改。当前的大多数NLP系统都是静态的,并且依赖固定培训数据。结果,他们无法在没有频繁,昂贵的重新训练的情况下适应时间变化 - 既包括测试分配变化又删除了培训数据。在本文中,我们通过纵向主题标签预测的任务来研究时间适应,并提出一种非参数技术作为一种简单但有效的解决方案:非参数分类器使用可以更新的数据存储器,以适应测试分配移位或培训数据删除,无需重新训练。我们发布了一个新的基准数据集,该数据集由2021年的713m推文以及它们的主题标签组成,分为连续的颞桶。我们将需要重新训练进行适应的参数神经主题标签分类和标签生成模型与非参数,无训练的密集检索方法进行了比较,该方法基于文本嵌入距离返回最近的邻居的主题标签。在我们的纵向Twitter数据集的实验中,我们发现密集的邻居检索的相对性能增益比测试集的最佳参数基线的相对性能增长率为64.12%,该测试集的表现出分布移位而不需要基于梯度的重新训练。此外,我们表明我们的数据存储方法特别适合动态删除的用户数据,并具有可忽略的计算成本和性能损失。我们的新颖基准数据集和实证分析可以支持未来对现实世界用户数据中AI系统部署时的重要挑战的研究。
translated by 谷歌翻译
在这项工作中,我们为预测钢琴音乐指法的任务提供了一种新方法。虽然先前的神经方法经常将其视为具有独立预测的序列标记问题,但我们提出了一个通过强化学习训练的清单系统,该系统除了隐藏状态外,还保留了最近预测的表示,使其能够在其上学习软限制输出结构。我们还证明,通过修改输入表示形式 - 在先前的工作中,使用神经模型通常以钢琴上的单个键对单键编码的形式进行编码,以对键盘上的相对位置编码先前的注释,我们可以取得更好的性能。此外,我们重新评估使用原始标记精度作为评估度量的使用,并指出它不能充分测量模型输出的流利度,即人类的可玩性​​。为此,我们比较了几个统计数据的方法,这些方法跟踪相邻手指的预测的频率,尽管独立合理的依次将在身体上具有挑战性,并实施强化学习策略以最大程度地减少这些训练损失的一部分。最后,通过人类专家的评估,我们证明了可归因于这些指标的改进的可执行性的显着提高。
translated by 谷歌翻译
现有的使用变压器模型生成多功能音乐的方法仅限于一小部分乐器或简短的音乐片段。这部分是由于MultiTrack Music的现有表示形式所需的冗长输入序列的内存要求。在这项工作中,我们提出了一个紧凑的表示,该表示可以允许多种仪器,同时保持短序列长度。使用我们提出的表示形式,我们介绍了MultiTrack Music Transformer(MTMT),用于学习多领音乐中的长期依赖性。在主观的听力测试中,我们提出的模型针对两个基线模型实现了无条件生成的竞争质量。我们还表明,我们提出的模型可以生成样品,这些样品的长度是基线模型产生的样品,此外,可以在推理时间的一半中进行样本。此外,我们提出了一项新的措施,以分析音乐自我展示,并表明训练有素的模型学会更少注意与当前音符形成不和谐间隔的注释,但更多地却更多地掌握了与当前相距4N节奏的音符。最后,我们的发现为未来的工作提供了一个新颖的基础,探索了更长形式的多音阶音乐生成并改善音乐的自我吸引力。所有源代码和音频样本均可在https://salu133445.github.io/mtmt/上找到。
translated by 谷歌翻译
Large language models are shown to present privacy risks through memorization of training data, and several recent works have studied such risks for the pre-training phase. Little attention, however, has been given to the fine-tuning phase and it is not well understood how different fine-tuning methods (such as fine-tuning the full model, the model head, and adapter) compare in terms of memorization risk. This presents increasing concern as the "pre-train and fine-tune" paradigm proliferates. In this paper, we empirically study memorization of fine-tuning methods using membership inference and extraction attacks, and show that their susceptibility to attacks is very different. We observe that fine-tuning the head of the model has the highest susceptibility to attacks, whereas fine-tuning smaller adapters appears to be less vulnerable to known extraction attacks.
translated by 谷歌翻译
One of the most impressive results of recent NLP history is the ability of pre-trained language models to solve new tasks in a zero-shot setting. To achieve this, NLP tasks are framed as natural language prompts, generating a response indicating the predicted output. Nonetheless, the performance in such settings often lags far behind its supervised counterpart, suggesting a large space for potential improvement. In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance. Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency, encouraging consistent predictions over this diverse set of prompts. Our method makes it possible to fine-tune the model either with extra unlabeled training data, or directly on test input at inference time in an unsupervised manner. In experiments, our approach outperforms the state-of-the-art zero-shot learner, T0 (Sanh et al., 2022), on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy. The gains are often attained with a small number of unlabeled examples.
translated by 谷歌翻译
The wide adoption and application of Masked language models~(MLMs) on sensitive data (from legal to medical) necessitates a thorough quantitative investigation into their privacy vulnerabilities -- to what extent do MLMs leak information about their training data? Prior attempts at measuring leakage of MLMs via membership inference attacks have been inconclusive, implying the potential robustness of MLMs to privacy attacks. In this work, we posit that prior attempts were inconclusive because they based their attack solely on the MLM's model score. We devise a stronger membership inference attack based on likelihood ratio hypothesis testing that involves an additional reference MLM to more accurately quantify the privacy risks of memorization in MLMs. We show that masked language models are extremely susceptible to likelihood ratio membership inference attacks: Our empirical results, on models trained on medical notes, show that our attack improves the AUC of prior membership inference attacks from 0.66 to an alarmingly high 0.90 level, with a significant improvement in the low-error region: at 1% false positive rate, our attack is 51X more powerful than prior work.
translated by 谷歌翻译
我们展示了一个简单的无监督掩蔽目标可以在抽象多文件新闻摘要上接近受监督性能。我们的方法列举了最先进的神经摘要模型,以预测相对于多文件组的最高词汇中心的蒙面输出源文档。在对多新闻数据集的实验中,我们蒙版的培训目标会产生一个系统,优势超过无监督的方法,并且在人类评估中超越了最佳监督方法,而无需访问任何地面真实的摘要。此外,我们评估了词汇中心的不同措施,灵感来自过去的采取摘要,影响最终表现。
translated by 谷歌翻译
我们提出了一种自我监督的预培训方法,用于学习手写和印刷历史文档转录的丰富视觉语言表示。监督我们预先调整我们预先培训的编码器表示两种语言的低资源文件转录后,(1)异构手写伊斯兰制稿件图像和(2)早期现代英语印刷文件,我们展现了有意义的认可改善从划痕培训的同一监督模型的准确性,只需30个线图像转录即可训练。我们屏蔽的语言模型式预培训策略,其中模型训练,以便能够识别从同一行中采样的患者的真正蒙面的视觉表示,鼓励学习强大的上下文化语言表示不变于抄写方式和打印噪声横跨文件。
translated by 谷歌翻译